MLDB gives users full control over where and how data is persisted. MLDB handles multiple protocol for URLs (see Files and URLs). In this tutorial, we provide examples to load files via http:// or https:// for files accessible on a HTTP server on the public internet or a private intranet.
For an example using the file:// for a file inside an MLDB container, see the Loading Data Tutorial for an example. MLDB also supports loading files from Amazon S3 and SFTP servers transparently. See the documentation for Files and URLs for more details.
The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.
In [2]:
from pymldb import Connection
mldb = Connection()
MLDB makes it very easy to load data from a public web server, since a file location can be specified using a remote URI. To illustrate this, we have chosen to load a file from the Facebook Social Circles dataset, hosted by the Stanford Network Analysis Project (SNAP), who provide many public datasets.
We will simply import the file http://snap.stanford.edu/data/facebook_combined.txt.gz using the import.text procedure. Notice that not only is the file hosted on a remote server, but it is also compressed. MLDB will decompress it seamlessly as it's being downloaded.
In [3]:
dataUrl = "http://snap.stanford.edu/data/facebook_combined.txt.gz"
print mldb.put("/v1/procedures/import_data", {
"type": "import.text",
"params": {
"dataFileUrl": dataUrl,
"headers": ["node", "edge"],
"delimiter": " ",
"quoteChar": "",
"outputDataset": "import_URL1",
"runOnCreation": True
}
})
We can now take a look:
In [5]:
mldb.query("SELECT * FROM import_URL1 LIMIT 5")
Out[5]:
In [4]:
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"
print mldb.put("/v1/procedures/import_data", {
"type": "import.text",
"params": {
"dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
"headers": ["circles"],
"delimiter": " ",
"quoteChar": "",
"outputDataset": "import_URL2",
"runOnCreation": True
}
})
Let's query our dataset to see what the data looks like:
In [5]:
mldb.query("SELECT * from import_URL2 LIMIT 5")
Out[5]:
The next step would be to format the data in a way we can easily query it. This is shown in the Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial, where we structure the data in a nicer way.
Check out the other Tutorials and Demos.